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Abstract 

It has been demonstrated that if two individual sequences are independent reahzations of two 
finite-order, finite alphabet, stationary Markov processes, an empirical divergence measure (ZMM) 
that is based on cross-parsing of one sequence relative to the second one converges to the relative 
entropy almost surely. This leads to a realization of an empirical, linear complexity universal classifier 
which is asymptotically optimal in the sense that the probability of classification error vanishes as the 
length of the sequence tends to infinity if the KL-divergence between the two processes is positive. It 
is demonstrated that a version of the ZMM is not only asymptotically optimal as the length of the 
sequences tends to infinity, but is also essentially-optimal for a class of finite-length sequences that 
are realizations of finite-alphabet, vanishing memory processes with positive transitions in the sense 
that the probability of classification error vanishes if the length of the sequences is larger than some 
positive integer No and leads to an asymptotically optimal classification algorithm. At the same time 
no universal classifier can yield an efficient discrimination between any two distinct processes in this 
class, if the length of the two sequences N is such that log N < log No , even if the KL-divergence 
between the two processes is positive. It is further demonstrated that not every asymptotically optimal 
universal classification algorithm is also essentially optimal. 

A variable length (VL) divergence that converges to the KL-divergence when the length of the 

sequences tends to infinity, is defined. Another universal classification algorithm which, like ZMM is 

also based on cross-parsing, is shown to be optimal relative to the VL divergence (rather than being 

just essentially optimal ) for any two finite-length sequences that are realizations of vanishing-memory 
processes. 

Index terms : universal classification, universal data-compression. 
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1 Introduction, notations and definitions 



A device called a classifier (or discriminator) observes two A''-sequences whose probability laws 
are Q and P respectively ( Q and P are defined on doubly infinite sequences in a finite alphabet 
A). Both Q and P are unknown. The classifier's task is to decide whether P = Q, ot P and Q 
are sufficiently different according to some appropriate criterion A. If the classifier has available 
an infinite amount of training data (i.e. if N is large enough), this is a simple matter. However, 
here we study the case where N is finite. 

The results in this paper are generalization of the results in [1] for finite-length test sequences 
rather than infinite ones. 

Consider random sequences from a finite alphabet A, where |A| = ^ < oo. Denote i vectors 

from A by = zi,...Z£ G A^, and use upper case Z's to denote random variables. When the 
superscript is clear from the context, it will be omitted. Similarly, a substring Zi, . . . , Zj; — oo < 

i < j < +00 is denoted by Z^ . 

Let a class of "vanishing memory" processes M be defined as follows: 

M = Mko./3j is the set of probability measures on doubly infinite sequences from the set A, 
with the following properties: 

A) Positive transitions property: 



for all sequences of for every P G M . 

B) Strong Mixing condition (following [2], Eq. (9)): 

Let {Xi}, — cx)<i<oo, bea random sequence with probability law P G M. We further assume 
that {Xi} is a stationary ergodic process where every member in satisfies the following 
condition: 

Condition 1 Let a{Xj;—oo < i,j < +00) be the a-field generated by the subsequence Xf. 
Then, there exists an integer kg, such that for all k > kg, all A G a{X^^) and all B G a{X^) 



P{Xi =Zi\X' 




■-CO) ^2 — 



z^)>a>0 



1 ^ P{B) 



(1) 



/3 - P{B\A) 
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for P{A),P{B) >0 and(3>l. 
C) P[Xf : P[Xi) < 2-^^] < a for every P E = M{a,P,ko,e). 

The constants ko,P, R and i do not depend on P. 

The condition in B) is reminiscent of (/)-mixing but is not identical to it. We remark that if P is 

any irreducible, aperiodic finite-order Markov process, this condition will be satisfied. Furthermore, 
the "positive transitions" condition may be guaranteed by dithering prior to the classification 
process, without violating the strong mixing condition. The condition in C) is satisfied by any 
ergodic process for some i, by the Asymptotic Equipartition Property (AEP) of information theory. 

2 Statement of results 

Let the normalized iV-th order K-L divergence between Q and P e M be: 



where P , Q are the A^-dimensional marginal measures of Q, P, and i^(*||*) denotes the conven- 
tional Kullback-Leibler divergence. Logarithms are taken on base 2 and obey log = 0. 

Note that due to the positive transitions property of the collection M, Dn{P\\Q) < log | < oo 
for every Q G M. 

The asymptotic K-L divergence between Q and P is given by: 



Formally, given an A^-sequence Y which is a realization of Q and another A^-sequence X which 
is a realization of P, we define a classifier fc (c-for "classifier") as a mapping of (X, Y) to {0, 1}, 




D{P IQ) =limsupL>jv(P||Q) 



(2) 



AT— >oo 



/e : A^^ X M ^ {0, 1} 



where fc = ^ declares Q to be diS^erent from P, fc = means Q = P. 



For any collection M G M of probability measures Pi;l < i < |M|, define 
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A(Pi,A,M) = 

P^[(X, Y) : either /c(X, Y) = 1 and Pj = Pi, or 

for some Pj : D{Pi \\Pj) > A, f,(Ki, Y,) = 0] (3) 
where A is a fidehty criterion. 
Also, let 

A(M) = sup X{Pi,A,M) (4) 

Pi&M 

We seek classifiers /c(X, Y) which are derived from two "training sequence" Y and X of length 
N and which will make A(M), the classification error, small for any M € M. 

A classifier fc is said to be asymptotically optimal if the probability of classification error tends 
to zero for every M G M as the length of the two sequences tends to infinity. 

The efficiency of different universal classifiers that are asymptotically optimal should also be 
judged by the rate at which the the corresponding classification error tends to zero as N increases, 
since, after all, one has to deal with finite-length sequences. 

In order to evaluate the efficiency of a universal classifier for finite-length sequences, we may 
consider appropriate fidelity function F{Q{N),P{N)) other than F{P{N),Q{N)) = Dn{P\\Q), as 
long as it converges to the "classical" KL-fidelity function D{P\\Q) as N tends to infinity. 

Hence,we limit the discussion to the class F of fidelity functions F{P^ ,Q^) such that 

lim sup F(P^,g^) = D{P\\Q) (5) 

almost surely, where D{P\\Q) is the K-L divergence 

Now, given a particular fidelity function F G F , and a collection M G M of probability 
measures Pf,! < i < \M\, assume that X is a realization of Pj and that Y is a realization of Pj, 
and define 

XFiPi,A,M) = 

P^[(X, Y) : either /c(X, Y) = 1 and Pj = Pi, or 

for some Pj : P(P,, Pj) > A, /^(X,, Y^) = 0] (6) 
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where A is a fidelity criterion. 
Also, let 

Xf{M) = sup XF{Pi,A,M) (7) 
PieM 

Hence, every classifier that utilizes a fidelity function / G F is asymptotically optimal. 

Let us first start with the classical case where F{Pn,Qn) = Dn{P\\Q). Let \{M) = Xf{M) 
for this particular fidelity function. 

Following [2] , it is shown in Theorem 1 below that the classification error that is associated with 
any universal classifier that has only the two sequences, X and Y at it's disposal, is close to one, 
for the class M^, if A/" < No2~^^, where Nq = 2^ and where R and £ are the parameters that define 
the class of processes M. But is there an optimal universal algorithm that will yield a vanishing 
classification error probability A(M) for > Aq^? 

An asymptotically optimal classifier fc is said to be also F — optimal over M, for finite length 
sequences if the probability of classification error \f{M) becomes negligible for training sequences 
longer than or equal to Nq2^^, where Aq is some positive integer such that any universal classifier 
will yield a probability of classification error Ai?(M) which is close to one, if the length of the 
sequences A" < Nq2~^^. The description of such an F — optimal universal classifier appears in 
Section 2 below. 

Apparently, not every fidelity function F G F leads to an associated F — optimal universal 
classifier. 

An asymptotically optimal classifier fc is said to be also essentially optimal over M, for finite 
length sequences if there exists a collection M G M of pairs P,Q : D]^{P\\Q) > A for which the 

probability of classification error A(M) is close to one for A^ < Ao2-^^ and becomes negligible for 
sequences longer or equal to NqI''^. 

It should be noted in passing that if one of the probability measures Q is fully known to the 
classifier, if the sequence Y is of length i and if the fidelity criterion is D^(P||Q) > A, there is indeed 
a classifier /c(X, Q) that is essentially optimal over the whole class M and is therefore optimal. 
This follows from the fact that the measure P^ of highly probable I -vectors can be well estimated 
from X once A^ > Ao2^^ . Hence one can generate a good estimate for D(^{P\\Q). However, if, as 



5 



in our case, Q is not known and the classification is based only on the observed vectors X and Y 
this need not be the case any more since no good empirical estimate for Q-improbable may be 
generated from Y unless it's length becomes much larger. 

It is demonstrated that a ZMM-based classifier is asymptotically optimal as well as essentially 
optimal relative to the fidelity function Dn{P\\Q). 

A common classifier is the Empirical Statistics Classifier (ESC), where 

i=0 ^Y{^i+i ) 

where PxiZf); G A'* denotes an empirically -derived estimate of the probability of n- vectors in 
X, where n = Sq log N, < So « 1. and where T is an integer satisfying Tn < N < {T + l)n. 

Also, Let /c(X, Y) = 1 if d(X, Y) > ^ and /c(X, Y) = if d(X, Y) < |. 

It follows that for the vanishing memory class of processes M, such an ESC is asymptotically 
optimal since, 

hm [a!(X, Y) - ^[logP(X) - logQ(X)] = 

N—^00 iV 

in probability. However, it is demonstrated below that the ESC is NOT essentially optimal. 

Thus, not every universal classifier which is asymptotically optimal is also essentially optimal. 

In the following converse theorem it is demonstrated that no efficient classification is possible 
(i.e. A(M) w 1) if AT < No2-'^. 

Let the class M G M be the class of processes that are generated as appears in [2,p.346, Proof 
of Theorem 6]. Following the proof of in [2, Theorem 6], we get the following converse theorem: 



Theorem 1 .• Let Q,P e M and let N < 2^(^-^). Then, for all a,e,A>0 and all R G O,log^, 
there exists a So = do{a,e,A,R) (sufficiently small) and an io such that for all £ > io any 
discriminator on M{R,a,6o,£) with parameters N,A,\ for which N < 2^^^~^\ must satisfy 
A(M) > 1 - e-^(^'*)^. 
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Proof of Theorem 1: By Lemma Al in [2], there exists a collection of cyclic subsets Ai of £- 
vectors from [0, 1]^, each of size 2^°^, and where, for some /?o(0 < /3o < 1/2) the Hamming distance 
between any x G A^, y E Aj; (i ^ j), dnix, y) > iPo 

Construction of M: At time zero, choose an ^-vector from with a uniform distribution on a 
cyclic set A^ . Repeat this ^-vector v times to create a vector. 

Next, add a u' -vector consisting of the first u' elements in the first vector chosen. Say that u' 

is uniformly distributed on [1,^]. Since the sets Ai are cyclic, any length i substring of this vector 
belong to A^. Thus, we have defined a random {^1 + v') vector. The process P is the concatenation 
of these sequences with a random-phase uniformly distributed between and {Iv — 1), and dithered 
by the additional modulo 2 of an i.i.d. "noise" vector W with PriWi = 1) = 5, PriWi = 0) = 1 — 5 
[2, page 346]. 

By Lemma Al in [2] it follows that by choosing 6 to be small enough, the divergence D(^{Pi\\Pj)\i ^ 
j ( as well as Di^^{Pi\\Pj) and D{Pi\\Pj)), for any two such processes can be made arbitrarily large. 

At the same time, the number of processes in Mi is at least 2^*^ ''"^ while there are only 2^ X 
sequences to cope with Af^, and by derivation similar to those of [2, Eqs (A12) and (A12)], leading 
to to the conclusion that \f{M) < 1 - e'^'^^^O'^) if A^o < 2'^^-"^^, even if the measure Pj that 
governs Y, is given. 

Section 1: A ZMM-based classifier is essentially optimal 

It will now be demonstrated that a classifier which is based on a a variant of ZMM [1] is essentially optimal. 
Denote by C-j'j{Xi) the the number of phrases that are generated by the LZ77 parsing of (see 
[4]). Thus, C-j'j{Xi) denotes the number of distinct phrases that are generated by applying the 
parsing procedure that is associated with LZ77, where each phrase is the longest incoming string 
of yet unparsed letters, that appears in the previously encoded data, extended by one letter. 

Also, let C-j'j{X^\\YI^) be the number of phrases that are generated by cross-parsing of X^ 
relative to . Thus, C'jt{X^\\Y^) denotes the number of phrases that are generated by applying 
the parsing procedure that is associated with LZ77, where in this case each phrase is the longest 
incoming string of yet unparsed letters in X^ that appears in Y^ where the minimum phrase 
length is set to be one. 
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Now, following [1], given two iV-sequences and let, 

dzMMiiX^lY,"") = ^[Cr7iX^\Yf)\ogN-Cr7{X^)logN] (8) 

Decide that /zmmi(X, Y) = 1 (i.e. Q and P are identical) if dzMMi{Z^\\X^) < e. Otherwise, set 
/zMMi(X, Y) = (i.e. decide that Q is different from P). The following Lemma states that the 
classifier fzMMi that is described above is asymptotically optimal over every finite class G 
of processes. 

Note that the ZMM measure that is used in [1] is slightly different, namely: 

dzMMiX^\Yf) = ^[CjriX^'\Y,^)logN- CjsiX^)logN] 

Lemma 1 Applying fzMMi to any finite class M E M of processes yields, 

limsupA(M) = 

AT— >oo 

Proof of Lemma 1: Lemma 1 above follows directly from [1] for the case where M is restricted 
to be a finite class of finite- order Markov processes with positive transitions. However, here we deal 
with the more general case were M may be any finite subset of the vanishing-memory collection 
M. This calls for a slight variations in the proofs that appear in [1]. 

Consider the vector X'^j^^ where ki,k2 are two arbitrary positive integers. Then, for any prob- 
ability measure P{.) G M( we have, by definition (positive transition property and strong mixing), 

P(X\) = P(X^,^,X0+1,X^) > lp(Xr,^^)(<5)'=P(X^) > ^P{Xl)P{X',-)6'^ 

and, 

PiX\) < mxZl){l - 5fP{X\-) 

< /3P(X,-'=)P(X^)(1 - 5f^j, < /3P(X\)P(X^) J, 

Re-derive Eq.(23), Eq.(26) and Eq.(32) in [1] for the more general "strong mixing" model that 
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is adopted here (replacing I in Eq.(32) by k and n by yields: 

El{z^) < [1 - ^VN-^^-i'^^^- - 1) (9) 

D Li 

Eq.(9) above replaces Eq.(23) in [1]. 

Also, the following equation replaces Eq.(26) in [1], 

- log P(z) > (1 - At) (c - 1) log iV - cf/e log ^ + log B) (10) 
. where c is defined in [1]. In a similar way, Eq.(32) in [1] is replaced by, 

c-l ^ ^ 

- log P(z) < - ^ log P(z^O + cfc[log ^ + ^ log B\ (11) 

i=l 

where c is defined in [1]. Thus, Eq.(28) and Eq.(35)in [1] remain valid. 

The proof then follows from steps that are similar to the steps that leads to part a) and part 
b) of Theorem 1 in [1], and by the fact that limiv-»oo[— ;^ log-P(X) — ;^C77(X) log A/"] = 0, almost 
surely. 

After establishing the asymptotic optimality of fzMMi classifier wc proceed to demonstrate it's 
esseniiaZ-optimality, as defined above. Consider again the the class of processes M that was used 
in the proof of Theorem 6 in [1] and in the proof of Theorem 1 above and let Nq > 2^(^+^). Then, 

Theorem 2 For some small positive number e and for a large enough £ 

X{Me) < max P^[X, Y : -^[C77(X|Y) log ATq - C77(X) log A/^q] < e for some Q ^ P or 

P&Ml Ao 

^[C77(X|Y) logiVo - C77(X) logiVo >eandQ = P]< 0{j^^ 
where Q,P e M. 

Proof of Theorem 2: 
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Parse X to generate a concatenation of ^-vectors (except, perhaps of the last vector in the 
generated concatenation), namely 

where n is an integer satisfying Nq — 1 < n£ < Nq. Define 

S{X^, Y) = 1 if = Yi_^+^ for some z G [0, TVq - £ + 1] and 

S{X\Y)=0. otherwise. (13) 

Now, by Eq.(A12) in [2] 

Pi{X(\Xi G Aj) < Pr{\W\ > —YTl) ^ 2-^(-^o-^(*'"')) (14) 

2 \Aj\ 

where c{Po,S,i is defined in Eq. (A12b) in [2]. 

Thus, by the union bound, for any X(^ & Aj;j ^ i 

Pri6iXi,Y) = 1) < N^2-^ii^-<^°'^^^ (15) 

Since Nq = 2^^^+^\ and setting Rq = R-^ yields, 

U '5(4^;^',Y)<^2-^W'^0'^)-f) (16) 

where E{.) denotes expectation relative to Pi{-)- 

Also, since at least one LZ77 phrase must either begin or end in any ^-vector for which 
5{X{,Y) = 0, and by the Markov inequality it follows that 

P^[C77(X|Y) - 1, for some Q P;Q, P e Me] < ^[1 - 2-¥«M)-i)] < 2-|Wo,5))-f) ^^^^ 

Next, C77(X) is evaluated for any Q G M^. By construction of M( above, the number of LZ77 
phrases in each of the consecutive vi + z/'-letters substrings in X is no more then O(j^), almost 
all of which appear in the first ^-vector that then repeats itself. 
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Hence, 



Therefore, by Eqs (17) and (18) 



-^C77(X) < O(-r^) (18) 



P,[X, Y : i^[C77(X|Y) - C77(X)] < e jor some Q P;Q, P e M^] < 2-^W,5))-f) (19) 



for any < e < 1 and a large enough £. 

The last step of the proof of Theorem 2 demonstrates that the classification error vanishes also 
in the case where Q = P. 

Let 7o = 1 — S, and letz^o > 1 satisfy 702^0 < 1- Then, following the derivation of Eq.(67) in [2], 
for any Z : P{Z) > ^ 

Pr{S{Z, Y) = 0) < PinuQjoY + 2-'°^ = 2"*^^; for some k>0 (20) 

Now, by construction and by Eq. (AlO) in [2], each process in Mi consists of statistically indepen- 
dent "vectors" of length vi + v' bits, where the probability of each such vector is lower-bounded 
by: 

P{Xf+''') > 2-^^ (21) 
for large enough £, with probability 1 - Pr{\W\ > ^jijj) > 1 - 2-^^^+<M) 

Thus, by Eqs.(20) and(21), and since no more than one LZ77 phrases either starts or ends in 
any vector Z in X for which S{Z,Y) = and since no (z^^ -|- z^')-vector contains more than 0(^|^) 
LZ77 phrases, it follows that 



£;[C77(X|Y); Q = P;PeMe]<^[l + (2-^i^+<'^°'^^^ + 2-'=^)0(-^)] (22) 

log Ul 

Eqs. (19) and (22) and the Markov inequality, and choosing large enough u and £ lead to the 
completion of the proof of Theorem 2. 
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An Empirical Statistics Classifier (ESC) is not essentially optimal 

Let A'' = 2^(^~'^) and let the empirical measures Px) i^i) Qy{Zi) be based on the recurrence 
time of Zf in X and in Y utilizing Kac lemma as in [2] where ^ > Px)(-^") > ^" and ^ > 

Then , by [2, Eq.(68)], there exists a small positive 60 « 1 such that, 

Pr[logPxN{Z^) - logP(^f)| > neo for some G A"] < (2/3 + l)2"i°s^2-"^o < 2-"^(''o'^o) 

Thus, by the Markov inequality 



T-l T}ivii+^)n\ 



By Theorem 1 above, A(M^) « 1 

If the length of the test sequences is increased from N = 2^^^'^^ to A^* = 2^^-^+^) , n = So log A'' 
is only slightly increased, n* = do log N* and therefore n* — n = 5o2e£ 

By the ^-positive transition property , 

n* Q(^f ) Q(^P) "-0 

Thus, no abrupt change in the value of d{X^ , Y-^) if the length is increased from N to N*. 
Hence, A(M^) w 1 even if the length of the sequences is increased to Nq, for any A for large £. 
Also, it follows from the definition of the class M that 



vanishes as AT gets large and hence the ESC is asymptotically optimal. 
However, as demonstrated above,it is NOT essentially optimal. 
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Section 2: A Variable length Fidelity Function Fvl 



Let Li,jv,p(Xf ) = maxjrrb : P{xi) > jj where L^ax = O(logiV). 
Define: 



log log N 
Fn,vl{P, Q) = TTTT tynx] - TrTf Tynvi (^^) 



Observe that due to the positive transition property of the class M, Li^]si,p{Xl'^^""^'°) > 

log ^ 

and increases monotonically with N. 

Thus, there exists some large enough Lmin{No) such that for N » Nq, each Li^Ar^p(X;(^)-vector 
consists of a large number of iy„jj„(A'^o)-vectors with a guard-space of ko letters in between any such 
two consecutive vectors, that, by the vanishing memory property, are approximately independent 
from each other. Thus, with high probability, each Li^jv,p(^i^) vector consists of about the same 
composition of Lmini^o) vectors. 

It then follows by the central limit theorem that almost surely, relative to the P measure, 
limjVo-»oo li™Ar-+oo[7 — ry^i ~ rr ^"^^/•vA'a ] ~ ^• 
Similarly, almost surely, relative to the P measure, 

limjvo^oo liruN' -.00 { . ^' ^ ~ f f/°^^'fy^'J = ^■ 

Setting N' so as to make £^p[Li_jv',p = Ep[Li,n,q leads to the conclusion that 
Fn,vl{P-,Q) tends to D{P\\Q) almost surely and hence F^yi^P^Q) G F as required. 



An optimal universal F^/i- classifier 



Consider the class M which was used in the proof of Theorem 1 above. 

It follows by Lemma Al in [2] that just like Dis[{Q\\P), FN,VLiQ-P) can be made arbitrarily 
large by selecting 6 to be small enough. It then follows that Theorem 1 holds for the variable length 
fidelity measure as well, and therefore if A?^ < Nq~^^, the probability of classification error must be 
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close to one for any universal classifier. 

A universal classification algorithm that yields a negligible classification error for any Q, P pair 
for which FNydQ-P) > A > if > NqI"^ is now introduced. Similar to the ZMM in [1], it is 
based on cross-parsing.. 

Let 

Li,jv(X) = maxjrr \j ' = ^t^^ for same 1 < i < y + 1 
Let M be a positive integer satisfying M = N^~2^ where e is an arbitrary small positive number. 



Li,jv,M(X|Y) = maxjrr U ^ = ^t^^ for some ^+1 < t < - j and every l<k<^- 



and let 



L;v.q,m(X|Y) = -— Yl ^i,iv,M(X|Y) (24) 

_i V / — - — 



-'max 



and, 



2 ^max 

Lm.p{^) = W^, E ^iM^^ (25) 



2 -^maa; 



Set /c(X, Y) = (i.e. Q ^ P) if: 

logN logN > ^ _|_ g 

^Ar.Q,M(X|Y) Ljv.p(X) 

and /c(X, Y) = 1 (i.e. Q = P) if: 

7 — — - < e for a preset small positive number e << A. 

JjAr.Q,M(-X-| I j iAr.p(-X-j 



Lemma 2 For any arbitrary small positive 6, there exists an Iq such that for any i > £q 

sup PriQ : \Li,N{^\Y) - Li,N,Q(X^^)| > <5] < L^s.Al^^2-'^<'''^o,S)e < g 

QeM 

for N = No2''^ and M = No2-¥ 
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and therefore, 

^ Lmax 



supF,[Q:|Lq,jv,m(X|Y)--— ^ i^i,iv,Q(Xf )| > ,5] < <5 



Also, 



@ J-'max j^—i 



The proof follows directly from Kac's Lemma and the properties of the class M(see Eq (68a) 
in [2]. 

Lemma 3 Let N = NqI''^ . Then, 
1) 

P,[|LAr.Q,M(X|Y) - £;pLl,^^,Q(X|Y)| > e] < 2-<^^^^^^^' 

2) 

Pr[\LN.p{^) - EpL,,nA^)\ > < 2-^('^0''^)^o 
for some c{ko,/3) > where /? < 22. 

Proof of Lemma 3: Parse X into consecutive substrings of no + fco + Ljnax letters each, where 
no = K{ko + Lmax', K » 1 .Observe that by the vanishing memory property, the successive blocks 
of no letters are "almost" independent since they are separated by a guard space of ko + L^ax 
letters and are governed, up to a factor of /?^, by a K-th product memoryless probability measure 
of no-vectors. 

Also observe that 

L^.q(X|Y) = L^,Q{X^<^\Y) + Lr,.Q{X2tim 
and, 

L^.p(x) = L^A^r + LNAK'oti")- 
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But, 

< LN.Q{X2ti'\^) < Lmaxko and, 

< LN.p{XZtl') < ^rnaxko 

Setting no = , Nq = uq + ko + L^ax-, and /3 < 25 leads to Lemma 1 by applying the 

ChernofF bound for sums of i.i.d bounded random variables. This leads to Lemma 3 and therefore, 
to the Fy/,-optimality of the proposed algorithm, which has a computational complexity that is 
proportional to N log N. 

In conclusion, it should be pointed out that by slightly modifying the ZMM algorithm in section 

N_ 

1 above by replacing C'j'j{X^) with 2C'j'j{X-^ ), and by changing the cross-parsing procedure 

that led to C^^{Xl \Y-[^ ) , where each phrase now is the longest incoming string of the yet unparsed 
letters that appears in all of the § M sub-blocks in Y, where M = NqI^^, one gets a universal 
classification algorithm, which by the same arguments that led to Lemma 2 and Lemma 3 above 
can be shown to be Fy^— optimal as well (at least for A > 4). However, following [4] where f 

was demonstrated to be an efficient entropy estimator, which was shown to converge to the entropy 
faster than one based on LZ77, it appears that the since the algorithm above utilizes 0{N) data 
points rather than 0{j^jq^ in the modified ZMM algorithm case, the latter may yield a classification 
error probability that converges to zero at a slower pace. 
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